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A promising idea for scaling robot learning to more complex tasks is to use elemental 
behaviors as building blocks to compose more complex behavior. Ideally, such building 
blocks are used in combination with a learning algorithm that is able to learn to select, 
adapt, sequence and co-activate the building blocks. While there has been a lot of work 
on approaches that support one of these requirements, no learning algorithm exists that 
unifies all these properties in one framework. In this paper we present our work on a 
unified approach for learning such a modular control architecture. We introduce new policy 
search algorithms that are based on information-theoretic principles and are able to learn 
to select, adapt and sequence the building blocks. Furthermore, we developed a new 
representation for the individual building block that supports co-activation and principled 
ways for adapting the movement. Finally, we summarize our experiments for learning 
modular control architectures in simulation and with real robots. 

Keywords: robotics, policy search, modularity, movement primitives, motor control, hierarchical reinforcement 
learning 



1. INTRODUCTION 

Robot learning approaches such as policy search methods (Kober 
and Peters, 2010; Kormushev et al., 2010; Theodorou et al., 2010) 
have been very successful. Kormushev et al. (2010) Learned to flip 
pan-cakes and Kober and Peters (2010) Learned the game ball- 
in-the-cup. Despite these impressive applications, robot learning 
still offers many challenges due to the inherent high-dimensional 
continuous state and action spaces, the high costs of generat- 
ing new data with the real robot, the partial observability of the 
environment and the risk of damaging the robot due to overly 
aggressive exploration strategies. These challenges have, so far, 
prevented robot learning methods to scale to more complex real 
world tasks. 

However, many motor tasks are heavily structured. Exploiting 
such structures may well be the key to scale robot learning to 
more complex real world domains. One of the most common 
structures of a motor task is modularity. Many motor tasks can 
be decomposed into elemental movements or movement primi- 
tives (Schaal et al, 2003; Khansari-Zadeh and Billard, 2011; Rozo 
et al., 2013) that are used as building blocks in a modular control 
architecture. For example, playing tennis can be decomposed into 
single stroke-based movements, such as a forehand and a back- 
hand stroke. To this end, we need a learning architecture that 
learns to select, improve, adapt, sequence and co-activate the ele- 
mental building blocks. Adaptation is needed as such building 
blocks are only useful if they can be reused for a wide range of 
situations, and, hence the building block needs to be adapted to 
the current situation. For example, for playing tennis, the ball will 
always approach the player slightly differently. Furthermore, we 
need to learn how to sequence such parametrized building blocks. 
Taking up our tennis example, we need to execute a sequence of 
strokes such that the opponent player can not return the ball on 



the long run. For sequencing the building blocks, we ideally want 
to be able to continuously switch from one building block to the 
next to avoid abrupt transitions, also called "blending" of build- 
ing blocks. Finally, co-activation of the building blocks would 
considerably increase the expressibility of the control architec- 
ture. Coming back to the tennis example, co-activating primitives 
that are responsible for the upper body movement, i.e., the stroke, 
and primitives that are responsible for the movement of the lower 
body, i.e., making a side step or a forward step would significantly 
reduce the number of required building blocks. 

In this paper we present an overview over our work that 
concentrates on learning such modular control architectures by 
reinforcement learning. We developed new policy search meth- 
ods that can select and adapt the individual building blocks 
to the current situation, learn and improve a large number of 
different building blocks as well as to learn how to sequence 
building blocks to solve a complex task. Our learning architec- 
ture is based on an information-theoretic policy search algorithm 
called Relative Entropy Policy Search (REPS) proposed by Peters 
et al. (2010). The main insight used by REPS is that the relative 
entropy between the trajectory distributions of two subsequent 
policies during policy search should be bounded. This bound is 
particularly useful in robotics as it can cope with many of the 
mentioned challenges of robot learning. It decreases the danger of 
damaging the robot as the policy updates stay close to the "data" 
generated by the old policy and do not perform wild exploration. 
Moreover, it results in a smooth learning process and prevents the 
algorithm from getting stuck prematurely in local minima even 
for high dimensional parameter spaces that are typically used in 
robotics (Peters and Schaal, 2008; Daniel et al, 2012a). While 
there are several other policy search approaches which can either 
learn the selection (da Silva et al., 2012), adaptation (Kober et al., 
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2010b; Ude et al., 2010) or the sequencing (Stulp and Schaal, 
201 1) of individual building blocks, to the best of our knowledge, 
our approach offers the first framework that unifies all these 
properties in a principled way. 

A common way to implement the building blocks is to 
use movement primitives (MPs). Movement primitives provide 
a compact representation of elemental movements by either 
parameterizing the trajectory (Schaal et al., 2003; Neumann, 
2011; Rozo et al., 2013), muscle activation profiles (dAvella 
and Pai, 2010) or directly the control policy (Khansari-Zadeh 
and Billard, 2011). All of these representations offer several 
advantages, such as the ability to learn the MP from demon- 
stration (Schaal et al., 2003; Rozo et al., 2013), global stability 
properties (Schaal et al., 2003), co-activation of multiple primi- 
tives (dAvella and Pai, 2010), or adaptability of the representation 
per hyper-parameter tuning (Schaal et al, 2003; Rozo et al., 
2013). However, none of these approaches unifies all the desirable 
properties of a MP in one framework. We therefore introduced 
a new MP representation that is particularly well suited to be 
used in a modular control architecture. Our MP representation is 
based on distributions over trajectories and is called Probabilistic 
Movement Primitive (ProMP). It can, therefore, represent the 
variance profile of the resulting trajectories, which allows us to 
encode the importance of time points as well as represent opti- 
mal behavior in stochastic systems (Todorov and Jordan, 2002). 
However, the most important benefit of a probabilistic represen- 
tation is that we can perform probabilistic operators on trajectory 
distributions, i.e., conditioning for adaptation of the MP and a 
product of distributions for co-activation and blending of MPs. 
Yet, such a probabilistic representation is of little use if we can- 
not use it to control the robot. Therefore, we showed that a 
stochastic time-varying feedback controller can be obtained ana- 
lytically, enabling us to use the probabilistic movement primitive 
approach as a promising future representation of a building block 
in modular control architectures. We will present experiments on 
several real robot tasks such as playing tether-ball and shooting a 
hockey puck. The robots used for the experiments are illustrated 
in Figure 1. 

1.1. RELATED WORK 

1. 1. 1. Movement representations 

Different elemental movement representations have been pro- 
posed in the literature. The most prominent one is the dynamic 
movement primitive (DMP) approach (Ijspeert and Schaal, 2003; 
Schaal et al., 2003). DMPs encode a movement in a parametrized 
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FIGURE 1 | (Left) The Barret WAM playing the game of tetherball. (Right) 
The KUKA lightweight arm playing a modified a version of hockey. 



dynamical system. The dynamical system is implemented as a 
second order spring damper system which is perturbed by a 
non-linear forcing function/. The forcing function depends non- 
linearly on the phase variable z t which denotes a clock for the 
movement. The evolution of the phase variable can be made 
faster or slower by the temporal scaling factor r, which finally 
also changes the execution speed of the movement. The forcing 
function is linearly parametrized by a parameter vector w and 
can be easily learned from demonstrations. In addition to the 
high dimensional parameters w, we can adjust meta-parameters 
of the DMPs such as the goal attractor g of the spring-damper 
system and temporal scaling factor. In Kober et al. (2010a), the 
DMPs have been extended to include the final desired velocity 
in its meta-parameters. DMPs have several advantages. They are 
easy to learn from demonstrations and by reinforcement learning, 
they can be used for rhythmic and stroke-based movements and 
they have build-in stability guarantees. However, they also suffer 
from some disadvantages. The can not represent optimal behav- 
ior in a stochastic environment. In addition, the generalization to 
a new end position is based on heuristics and not learned from 
demonstrations and it is not clear how DMPs can be combined 
simultaneously. Several other movement primitive representation 
have been proposed in the literature. Some of them are based on 
DMPs to overcome their limitations (Calinon et al., 2007; Rozo 
et al, 2013), but none of them can overcome all the limitations 
in one framework. Rozo et al. (2013) estimate a time varying 
feedback controller for the DMPs, however, how this feedback 
controller is obtained is based on heuristics. They also implement 
a combination of primitives as a product of GMMs which is sim- 
ilar to the work presented here on the probabilistic movement 
primitives. However, this approach is lacking a principled way of 
determining a feedback controller that exactly matches the trajec- 
tory distribution. Therefore, it is not clear what the result of this 
product is if we apply the resulting controller on the robot. 

Most of the movement representations explicitly depend on 
time (Ijspeert and Schaal, 2003; Neumann and Peters, 2009; 
Paraschos et al., 2013; Rozo et al., 2013). For time-dependent rep- 
resentations, a linear controller is often sufficient to model com- 
plex behavior as the non-linearity is induced by the time depen- 
dency. In contrast, time-independent models such as the Stable 
Estimator of Dynamical Systems (SEDS) approach (Khansari- 
Zadeh and Billard, 2011) directly estimate a state dependent 
policy that is independent of time. Such models require more 
complex, non-linear controllers. For example, the SEDS approach 
uses a GMM to model the policy. The GMM is estimated such that 
the resulting policy is proofed to be stable. Due to the simplicity of 
the policy, time-dependent representations can be easily scaled up 
to higher dimensions as shown by Ijspeert and Schaal (2003). Due 
to the increased complexity, time-independent models are typi- 
cally used for lower dimensional movements such as modeling the 
movement directly in task space. Yet, a time-independent model is 
the more general representation as it does not require the knowl- 
edge of the current time step. In this paper, we will nevertheless 
concentrate on time -dependent movement representations. 

1.1.2. Policy search 

The most common reinforcement learning approach to learn the 
parameters of an elemental movement representation such as a 
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DMP is policy search (Williams, 1992; Peters and Schaal, 2008; 
Kober and Peters, 2010; Kober et al., 2010a). The goal of policy 
search is to find a parameter vector of the policy such that the 
resulting policy optimizes the expected long-term reward. Many 
policy search methods use a stochastic policy for exploration. 
They can be coarsely categorized according their policy update 
strategy. Policy gradient methods (Williams, 1992; Peters et al., 
2003) are one of the earliest policy update strategies that were 
applied to motor primitive representations. They estimate the 
gradient of the expected long-term reward with respect to the pol- 
icy parameters (Williams, 1992) and update the policy parameters 
in the direction of this gradient. The main disadvantages of policy 
gradient methods are the necessity to specify a hand-tuned learn- 
ing rate, the poor learning speed and that typically many samples 
are required to obtain a new policy without sample re-use. 

More recent approaches rely on probabilistic methods. These 
methods typically base their derivation on the expectation- 
maximization algorithm (Vlassis et al, 2009; Kober and Peters, 
2010) and formulate the policy search problem as inference 
problem by transforming the reward into an improper proba- 
bility distribution, i.e., the transformed reward is required to be 
always positive. Such transformation is typically achieved by an 
exponential transformation with a hand-tuned temperature. The 
resulting policy update can be formulated as a weighted model 
fitting task where each sample is weighted by the transformed 
long-term rewards (Kober and Peters, 2010). Using a probabilis- 
tic model fitting approach to compute the policy update results 
in the important advantage that we can use a big toolbox of algo- 
rithms for estimating structured probabilistic models, such as the 
expectation maximization algorithm (Dempster et al, 1977) or 
variational inference (Neal and Hinton, 1998). Additionally, it 
does not require a user specified learning rate. These approaches 
typically directly explore in the parameter space of the policy 
by estimating a distribution over the policy parameters. Such 
approach works well if we have a moderate number of parameters. 

Another algorithm that has recently gained a lot of atten- 
tion is the policy improvement by path integrals (PI 2 ) algorithm 
(Theodorou et al, 2010; Stulp and Sigaud, 2012). The path inte- 
gral theory allows to compute the globally optimal trajectory 
distribution along with the optimal controls without requiring 
a value function as opposed to traditional dynamic program- 
ming approaches. However, the current algorithm is limited to 
learning open-loop policies (Theodorou et al, 2010; Stulp and 
Sigaud, 2012) and may not be able to adapt the the variance of 
the exploration policy (Theodorou et al., 2010). 

1. 1.3. Generalization of skills 

An important requirement in a modular control architecture is 
that we can adapt a building block to the current situation or task. 
We will describe a task or a situation with a context vector s. The 
context vector can contain the objectives of the agent, e.g., throw- 
ing a ball to a desired target location, or physical properties of the 
environment, e.g., the mass of the ball to throw. Ude et al. (2010) 
use supervised learning to generalize movement primitives from 
a set of demonstrations. Such approach is well suited to general- 
ize a set of demonstrations to new situations, but can not be used 
to improve the skills upon the demonstration. To alleviate this 



limitation, da Silva et al. (2012) combines low-dimensional sub- 
space extraction for generalization and policy search methods for 
policy improvement. Finding such low-dimensional sub-spaces is 
an interesting idea that can considerably improve the generaliza- 
tion of the skills. Yet, there is one important limitation of the 
approach presented in da Silva et al. (2012). The algorithms for 
policy improvement and skill generalization work almost inde- 
pendently from from each other. The only way they interact is that 
the generalization is used as initialization for the policy search 
algorithm when a new task needs to be learned. As a conse- 
quence, the method needs to create many roll-outs for the same 
task/context in order to improve the skill for this context. Such 
limitation is relaxed by contextual policy search methods (Kober 
et al., 2010b; Neumann, 201 1). Contextual policy search methods 
explicitly learn a policy that choses the control parameters 9 in 
accordance to the context vector s. Therefore, a different context 
can be used for each roll-out. Kober et al. (2010b) us a Gaussian 
Process (GP) for generalization. While GPs have good generaliza- 
tion properties, they are of limited use for policy search as they 
typically learn an uncorrected exploration policy. The approach 
in Neumann (2011) can use a directed exploration strategy, but it 
suffers from high computational demands. 

1. 1.4. Sequencing of skills 

Another requirement is to learn to sequence the building 
blocks. Standard policy search methods typically choose a sin- 
gle parameter vector per episode. Hence, such methods can be 
used to learn the parameters of a single building block. In order to 
sequence building blocks, we have to learn how to choose multi- 
ple parameter vectors per episode. The first approach (Neumann 
and Peters, 2009) for learning to sequence primitives was based 
on value-function approximation techniques, which restricted its 
application on a rather small set of parameters for each primi- 
tive. Recently, (Stulp and Schaal, 2011) adapted the path integral 
approach to policy search to sequence movement primitives. 
Other approaches (Morimoto and Doya, 2001; Ghavamzadeh 
and Mahadevan, 2003) use hand-specified sub-tasks to learn the 
sequencing of elemental skills. Such an approach is limited in its 
flexibility of the resulting policy and the sub-tasks are typically 
not easy to define manually. 

1. 1.5. Segmentation and modular imitation learning 

Segmentation (Kulic et al., 2009; Alvarez et al, 2010; Meier et al., 
2011) and modular imitation learning (Niekum et al, 2012) is a 
very important and challenging problem to autonomously extract 
the structure of the modular control policy from demonstrations. 
In Meier et al. (2011) and Alvarez et al. (2010), the segmentation 
is done due to parameter changes in the dynamical system that 
is supposed to have created the motion. In Chiappa and Peters 
(2010), Bayesian methods are used to construct a library of build- 
ing blocks. Repeated skills are modeled to be generated by one 
of the building-blocks, which are rescaled and noisy. Based on 
the segmentation of the demonstrations, we can infer the single 
building blocks from the data by clustering the segments. One 
approach that integrates clustering and segmentation is to use 
Hidden Markov Models (HMMs). Williams and Storkey (2007) 
used a HMM to extract movement primitives from hand-writing 
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data. While this is a very general approach, it has only been used to 
rather low-dimensional data, i.e., 2-D movements. Niekum et al. 
(2012) use a beta-process auto regressive HMM to estimate the 
segmentation which has the advantage the number of building 
blocks can also be inferred from data. DMPs are used to represent 
the policy of the single segments. Butterfield et al. (2010) use a 
HMM to directly estimate the policy. For each hidden state, they 
fit a Gaussian Process model to represent the policy of this hid- 
den state. The advantages of these imitation learning approaches 
is that we can also estimate the temporal structure of the mod- 
ular control policy, i.e., when to switch from one building block 
to the next. So far, such imitation learning approaches have not 
been integrated in a reinforcement learning framework, which 
seems to be a very interesting direction. For example, in current 
reinforcement learning approaches, the duration of the building 
blocks is specified by a single parameter. Estimating the duration 
of the building blocks from the given trajectory data seems to be 
a fruitful and more general approach. 

2. INFORMATION THEORETIC POLICY SEARCH FOR 
LEARNING MODULAR CONTROL POLICIES 

In this section we will sequentially introduce our information the- 
oretic policy search framework used for learning modular control 
policies. We start our discussion with the adaptation of a single 
building block. Subsequently, we discuss how to learn to select a 
building block and, finally, we will discuss sequencing of building 
blocks. 

After introducing each component of our framework, we 
briefly discuss related experiments on real robots and in sim- 
ulation. In this paper, we can only give a brief overview over 
the experiments. For more details, we refer to the correspond- 
ing papers. In our experiments with our information theoretic 
policy search framework, we used Dynamic Movement Primitives 
(DMP) introduced in Schaal et al. (2003) as building blocks in our 
modular control architecture. In all our experiments, we used the 
hyper-parameters of a DMP as parameters of the building blocks, 
such as the final positions and velocities of the joints (Kober et al., 
2010a) as well as the temporal scaling factor of the DMPs for 
changing the execution speed of the movement. 

2.1. LEARNING TO ADAPT THE INDIVIDUAL BUILDING BLOCKS 

We formulate the learning of the adaptation of the building 
blocks as contextual policy search problem (Kober et al., 2010b; 
Neumann, 2011; Daniel et al, 2012a), where we will for now 
assume that we want to execute only a single building block. 
Adaptation of a building block is implemented by an upper-level 
policy 7r(0|s) that chooses the parameter vector 9 of the build- 
ing block according to the current context vector s. The context 
describes the task. It might contain objectives of the agent or 
properties of the environment, for example, the incoming velocity 
of a tennis ball. After choosing the parameters 9, the lower level 
policy u f = /(x f , 0) of the building block takes over and is used 
to control the robot. Note that we use the symbol x t to denote the 
state of the robot. The state x f typically contains the joint angles 
q t and joint velocities q f of the robot and it should not be con- 
fused with the context vector s. The context vector s describes 
the task and contains higher level objectives of the agent. For 



example, such a lower level policy can be defined by a trajectory 
tracking controller that tracks the desired trajectory of a dynamic 
movement primitive (DMP) (Schaal et al., 2003). 

Our aim is to learn an upper-level policy that maximizes the 
expected reward 

] n = jj fj,(s)jr(d\s)R(s,0)dsdd, 

R(s,d) = j p(r\s,0)r(r,s)dr, (1) 

where R(s,9) is the expected reward of the resulting trajectory 
r when using parameters 0 in context s and /x(s) denotes the 
distribution over the contexts that is specified by the learning 
problem. The distribution p(r|s, 0) denotes the probability of a 
trajectory given s and 0 and r(r, s) a user-specified reward func- 
tion that depends on the trajectory r and on the context s. We 
use the Relative Entropy Policy Search (REPS) algorithm (Peters 
et al, 2010) as underlying policy search method, The basic idea of 
REPS is to bound the relative entropy between the old and the new 
parameter distribution. Here, we will consider the episode-based 
contextual formulation of REPS (Daniel et al., 2012a; Kupcsik 
et al., 2013) that is tailored for learning such an upper-level 
policy. The policy update step is defined as constrained optimiza- 
tion problem where we want to find the distribution p(s, 0) = 
fj.(s)n(9\s) that maximizes the average reward given in Eq. 1 
with respect to p(s,0) and simultaneously satisfies several con- 
straints. We will first discuss these constraints and show how to 
compute p(s,0). Subsequently, we will explain how to obtain the 
upper-level policy 7r(0|s) fromp(s, 0). 

Generally, we initialize any policy search (PS) method with 
an initial policy qo(s, 0) = )i(s)qo(0\s), either obtained through 
learning from demonstration or by manually setting a distribu- 
tion for the parameters. The variance of the initial distribution 
qo (s, 0 ) defines the exploration region. Policy search is an iterative 
process. Given the sampling distribution qo(s, 0), we obtain a new 
distribution pi(s, 0). Subsequently, pi is used as new sampling 
policy qi and the process is repeated. 

PS methods need to find a trade-off between keeping the ini- 
tial exploration and constricting the policy to a (typically local) 
optimum. In REPS, this trade-off is realized via the Kullback- 
Leibler (KL) divergence. REPS maximizes the reward under the 
constraint that the KL-divergence to the old exploration policy is 
bounded, i.e., 

e>KL(p(s,9)\\q(s,9)). (2) 

Due to this bound, we can choose between exploitation with the 
greedy policy (high KL-bound) or continue to explore with the 
old exploration policy (very small KL-bound). The KL divergence 
in REPS bounds not only the conditional probability tc{0 |s), i.e., 
the differences in the policies, but also the joint state-action prob- 
abilities p(s,0) to ensures that the observed state-action region 
does not change rapidly over iterations, which is paramount to a 
real robot learning algorithm. Using the (asymmetric) KL diver- 
gence KL (p(s, 0 ) 1 1 q(s, 0 )) allows us to find a closed form solution 
of the algorithm. Such closed form would not be possible with the 
opposite KL divergence, i.e., KL (q(s, 0)||p(s, 9)). 
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FIGURE 2 | The figure illustrates the joint trajectories that can be 
generated when using a linear Gaussian to adapt the DMP parameters 
according to a one dimensional context variable. In this illustration, we 
show the color coding for the context variable in the color bar on the right 
and show how the generated trajectories change in the main plot. For this 
plot, we assumed no exploration noise and adapted ten basis functions of 
the DMR As we can see, complex behavior can emerge already with a 
linear adaptation model due to the high-dimensionality of the parameter 
space. 



We also have to consider that the context distribution p(s) = 
f p(s,6)d6 cannot be freely chosen by the agent as it is speci- 
fied by the learning problem and given by /x(s). Hence, we need 
to add the constraints Vs : p(s) = /x(s) to match the given con- 
text distribution fi(s). However, for continuous context vector 
s, we would end up with infinitely many constraints. Therefore, 
we resort to matching feature averages instead of single prob- 
ability values, i.e., f p(s)<j>(s)ds = ij>, where 0(s) is a feature 
vector describing the context and <j> is the mean observed 
feature vector. 

The resulting constrained optimization problem is now 
given by 

max jjp(s,0)R(s,0)dsdO, s.t: e > KL(p(s,0)||q(s,0)) , 
s, e 



J p(s)0(s)ds = <j>, J J p(s, d)dsdd = 1. (3) 

s, e 

It can be solved by the method of Lagrangian multipliers and 
yields a closed-form solution solution for p that is given by 

p(s, 6) oc q(s, d) e^ R(S ' 6) ~ V(s) j , (4) 

where V(s) = v r $(s) is a context dependent baseline that is sub- 
tracted from the the reward signal. The scalar and the vector v 
are Lagrangian multipliers that can be found by optimizing the 
dual function g(ii, v) (Daniel et al., 2012a). It can be shown that 
V(s) can be interpreted as value function (Peters et al, 2010) 
and, hence, estimates the mean performance of the new policy 
in context s. 

The optimization defined by the REPS algorithm is only per- 
formed on a discrete set of samples D= {s^,(?^,R^}, i = 
1 , . . . , N, where Ryl denotes the return obtained by the z'th roll- 
out. The resulting probabilities p (s^,0^), see Equation (4), of 
these samples are used to weight the samples. In order to obtain 
the weight pM f or eacn sample, we need to divide p(s['\d^) 
by the sampling distribution q(s, 6) to account for the sampling 
probability (Kupcsik et al., 2013), i.e., 

m P(s [,] ,0 [,] ) (R{s,0)-V{s)\ 

P = — 7-Fn FTT a ex P • ( 5 ) 

Hence, being able to sample from q is sufficient and q is not 
needed in its analytical form. 

The upper-level policy jr(0|s) is subsequently obtained by 
performing a weighted maximum-likelihood (ML) estimate. We 
use a linear-Gaussian model to represent the upper-level pol- 
icy 7t(0\s) = J\f(0\n + As, E) of the building block, where the 
parameters a, A and £ are obtained through the ML estima- 
tion. As a building block is typically reused only for similar 
contexts s, a linear model is sufficient in most cases. Figure 2 
shows an illustration of how a linear model can adapt the 
trajectories generated by a DMP. In practice, we still need 



an initial policy q. This initial policy can either be obtained 
through learning from demonstration or by selecting reasonable 
parameters and variance if the experimenter has sufficient task 
knowledge. 

In Kupcsik et al. (2013), we further improved the data- 
efficiency of our contextual policy search algorithm by learning 
probabilistic forward models of the real robot and its environ- 
ment. With these forward models, we can predict the reward 
R(s™,Qb1\ for unseen context-parameter pairs and 0® and 
use these additional samples for computing the policy update. 
The data-efficiency of our method could be improved up to two 
orders of magnitude using the learned forward models. As we 
used Gaussian Processes (GPs) (Rasmussen and Williams, 2006) 
to represent the forward models, this extension of our method 
is called GPREPS. These forward models were used to generate 
additional data points that are used for the policy update. For each 
of these virtual data points, we generated 15 trajectories with the 
learned forward models. We used the average reward of these pre- 
dicted trajectories as reward used in the REPS optimization. We 
used sparse GPs (Snelson and Ghahramani, 2006) to deal with 
the high number of data points within a reasonable computation 
time. 

2. 1. 1. Experimental evaluation of the adaptation of building 
blocks - robot hockey target shooting 

In this task we used GPREPS with learned forward models to 
learn how to adapt the building blocks such that the robot can 
shoot hockey pucks to different locations. The objective was to 
make a target puck move for a specified distance by shooting a 
second hockey puck at the target puck. The context s was com- 
posed of the initial location [b x , b y ] T of the target puck and 
the distance d* that the target puck had to be shoot, i.e., s = 
[b x , by, d*] T . We chose the initial position of the target puck to be 
uniformly distributed from the robot's base with displacements 
b x € [1.5, 2.5]m and b y € [0.5, l]m. The desired displacement 
context parameter d* is also uniformly distributed d* e [0, l]m. 
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The reward function 

r(r, s) = -min \\x t -b\\ 2 - \\d T - d*\\ 2 

consist of two terms with equal weighting. The first term penalizes 
missing the target puck located at position b = [b x , b y ] T , where 
the control puck trajectory is X\-j- The second term penalizes the 
error in the desired displacement of the target puck, where dj is 
the resulting displacement of the target puck after the shot. The 
parameters 9 define the weights and goal position of the DMP. 
The policy in this experiment was a linear Gaussian policy. The 
simulated robot task is depicted in Figure 3. 

GPREPS first learned a forward model to predict the initial 
position and velocity of the first puck after contact with the racket 
and a travel distance of 20 cm. Subsequently, GPREPS learned the 
free dynamics model of both pucks and the contact model of the 
pucks. We assumed that we know the geometry of the pucks to 
detect a contact. If there is a contact, we used the contact model to 
predict the state of both pucks after the contact given the state of 
both pucks before the contact. From this state, we again predicted 
the final puck positions after they came to stop with a separate GP 
model. 

We compared GPREPS in simulation to directly predicting 
the reward R(s,9), model-free REPS and CrKR (Kober et al., 
2010b), a state-of-the-art model-free contextual policy search 
method. The resulting learning curves are shown in Figure 3 
(middle). GPREPS learned the task already after 120 interac- 
tions with the environment while the model-free version of REPS 
needed approximately 10000 interactions. Directly predicting the 
rewards from parameters 6 using a single GP model resulted in 
faster convergence but the resulting policies still showed a poor 
performance (GP direct). The results show that CrKR could not 
compete with model-free REPS. The learned movement is shown 
in Figure 3 for a specific context. After 100 evaluations, GPREPS 
placed the target puck accurately at the desired distance with an 
error < 5 cm. 

Finally, we evaluated the performance of GPREPS on the 
hockey task using a real KUKA lightweight arm. The learning 
curve of this experiment is shown in Figure 3 (right) and con- 
firms that GP-REPS can find high-quality policies within a small 
amount of interactions with the environment. 



2.2. LEARNING TO SELECT THE BUILDING BLOCKS 

In order to select between several building blocks o, we add an 
additional level of hierarchy on top of the upper-level policies 
of the individual building blocks. We assume that each building 
block shares the same parameter space. The parameters are now 
selected by first choosing the building block to execute with a gat- 
ing policy ttq(o\s) and, subsequently, the upper level parameter 
policy np(d\s, o) of the building block o selects the parameters 6. 
Hence, n(9\s) can be written as hierarchical policy 

7r(0|s) = J^7t G (o\s)jr P (e\s, o). (6) 

0 

In this model, the gating policy composes a complex, non- 
linear parameter selection strategy out of the simpler upper level 
policies of the building blocks. Moreover, it can learn multiple 
solutions for the same context, which also increases the versa- 
tility of the learned motor skill (Daniel et al., 2012b). While a 
similar decomposition in gating policy and option policies has 
been presented in da Silva et al. (2012), their framework was not 
integrated in a reinforcement learning algorithm, and hence, gen- 
eralization and improvement the building blocks is performed 
by two independent algorithms, resulting in sample-inefficient 
policy updates. 

To incorporate multiple building blocks, we now bound the 
Kullback-Leibler divergence between q(s, 6, o) and p(s, 9, o). As 
we are interested in versatile solutions, we also want to avoid 
that several building blocks concentrate on the same solu- 
tion. Hence, we want to limit the "overlap" between build- 
ing blocks in the parameter space. In order to do so, we 
bound the expected entropy of the conditional distribution 
p(o\s, 9), i.e., 

- / p(s, 9) ^]p(o|s, 9) logp(o|s, 9)dsd9 < tc. (7) 

^ o 

A low entropy of p(o\s, 9) ensures that our building blocks do 
not overlap in parameter space and, thus, represent individual 
and clearly separated solutions (Daniel et al, 2012a). The new 
optimization program results in the hierarchical version of REPS, 
denoted as HiREPS. We can again determine a closed form solu- 
tion forp(s, 9, o) which is given in Daniel et al. (2012a). As in the 




FIGURE 3 | (Left) Robot hockey target shooting task. The robot has to shoot robot hockey task in simulation. GPREPS was able to learn the task within 

a puck at the target puck such that the target puck moves for a specified 120 interactions with the environment, while the model-free version of REPS 

distance. Both, the initial location of the target puck [b x , b Y ] T and the desired needed about 10000 episodes. (Right) GPREPS learning curve on the real 

distance d* to move the puck were varied. (Middle) Learning curves on the robot arm. 
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previous section, the optimization problem is only solved for a 
given set of samples that has been generated from the distribution 
q(s, 0). Subsequently, the parameters of the gating policy and the 
upper-level policies are obtained by weighted ML estimates. We 
use a Gaussian gating policy and an individual linear Gaussian 
policy tt(0\s, o) = 7V(0|a o + A 0 s, E 0 ) for each building block. 
As we use a linear upper-level policy and the used DMPs pro- 
duce only locally valid controllers, our architecture might require 
a large number of building blocks. 

2.2. 1. Experimental evaluation of the selection of building blocks - 
robot tetherball 

In robot tetherball, the robot has to shoot a ball that is fixed 
with a string on the ceiling such that it winds around a pole. 
The robot obtains a reward proportional to the speed of the ball 
winding around the pole. There are two different solutions, to 
wind the ball around the left or to the right side of the pole. 
Two successful hitting movements of the real robot are shown in 
Figure 5. We decompose our movement into a swing-in motion 
and a hitting motion. As we used the non-sequential algorithm 
for this experiment, we represented the two motions by a sin- 
gle set of parameters and jointly learn the parameters 0 for the 
two DMPs. We start the policy search algorithm with 15 options 
with randomly distributed parameters sampled from a Gaussian 
distribution around the parameters of the initial demonstration. 
We use a higher number of building blocks to increase the prob- 
ability of finding both solutions with the building blocks. If 
we use two randomly initialized building blocks, the probabil- 
ity that both cover the same solution is quite high. We delete 
unused building blocks that have a very small probability of 
being chosen, i.e., p(o) < 0.001. The learning curve is shown 
in Figure 4 (left). The noisy reward signal is mostly due to the 
vision system and partly also due to real world effects such as 
friction. Two resulting movements of the robot are shown in 
Figure 5. The robot could learn a versatile strategy that con- 
tained building blocks that wind the ball around the left and 
building blocks that wind the ball around the right side of 
the pole. 



2.3. LEARNING TO SEQUENCE THE BUILDING BLOCKS 

To execute multiple building blocks in a sequence, we refor- 
mulate the problem of sequencing building blocks as Markov 
Decision Process (MDP). Each building block defines a transi- 
tion probability p(s' | s, 0) over future contexts and an immediate 
reward function _R(s, 0). It is executed until its termination con- 
dition f 0 (s, 0) is satisfied. However, in our experiments, we used 
a fixed duration for each building block. Note that traditional 
reinforcement learning methods, such as TD-learning, can not 
deal with such MDPs as its action space is high dimensional and 
continuous. 

We concentrate on the finite-horizon case, i.e., each episode 
consists of K decision steps where each step is defined as the exe- 
cution of an individual building block. For clarity, we will only 
discuss the sequencing of a single building block, however, the 
selection of multiple building blocks at each decision step can be 
easily incorporated (Daniel et al., 2013). 

In the finite horizon formulation of REPS we want to find 
the probabilities p k (s, 0) = p k (s)n(0\s), k < K, andp^ + i(s) that 
maximize the expected long term reward 

J = j pK+i(s)R K +i(s)ds+ jjp k (s,0)R k (s k ,6 k )dsdO, 

k=1 s,e 

where _Rj^ + i(sj^+ 1) denotes the final reward for ending up in the 
state sk + i after executing the last building block. As in the pre- 
vious case, the initial context distributions is given by the task, 
i.e., Vs : pi(s) = /U-i(s). Furthermore, the context distribution at 
future decision steps k > 1 need to be consistent with the the past 
distributions p k - i(s, 0) and the transition modelp(s'|s, 0), i.e., 

Vs', k > 1 :p k (s') = jj p k - l (s,0)p(s'\s,0)dsd0, 
s.e 

for each decision step of the episode. These constraints connect 
the policies for the individual decision-steps and result in a pol- 
icy 7t k (&\s) that optimizes the long-term reward instead of the 
immediate ones. As in the previous sections, these constraints are 
again implemented by matching feature averages. 

The closed form solution of the joint distribution p k (s,d) 
yields 

pk(s, 0) oc q k (s, 0) exp , 

V m ) 

A k (s, 0) = R k (s, 0) + E p(s1s , e) [V k+ ! (s')] - V t (s). 

We can see that the reward R k (s, 0) is transformed into an advan- 
tage function A k (s, 0) where the advantage now also depends on 
the expected value of the next state Ep( s '| Si e) [vWi ( S ')J - This 
term ensures that we do not just optimize the immediate reward 
but the long term reward. 

2.3. 1. Experimental evaluation of sequencing of building blocks - 
sequential robot hockey 

We used the sequential robot hockey task to evaluate sequential 
motor skill learning framework. The robot has to move the target 
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FIGURE 4 | Average rewards for learning tetherball on the real robot. 

Mean and standard deviation of three trials. In all of the three trials, after 50 
iterations the robot has found solutions to wind the ball around the pole on 
either side. 
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FIGURE 5 | Time series of a successful swing of the robot. The robot first has to swing the ball to the pole and, subsequently, when the ball has swung 
backwards, can arc the ball around the pole. The movement is shown for a shoot to the left and to the right of the pole. 
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FIGURE 6 | (Left) The sequential robot hockey task. The robot has two 
pucks, the pink control puck and the yellow target puck. The task is to 
shoot the yellow target puck into one of the colored reward zones. Since 
the best reward zone is too far away from the robot to be reached with 
only one shot, each episode consists of three strikes. After each strike, 
the control puck is returned to the robot, but the target puck is only reset 
after one episode is concluded. (Middle) Comparison of sequential motor 
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primitive learning to the episodic learning setup on the simulated robot 
hockey task. The sequential motor primitive learning framework was able 
to find a good strategy to place the puck in the third reward zone in most 
of the cases while the episodic learning scenario failed to learn such a 
strategy. (Right) One trial of the real robot hockey tasks. The robot starts 
with a negative initial reward and learns to achieve an average reward of 
2.5 after 300 episodes. 



puck into one of three target areas by sequentially shooting a con- 
trol puck at the target puck. The target areas are defined by a 
specified distance to the robot, see Figure 6 (left). The robot gets 
rewards of 1, 2, and 3 for reaching zone 1, 2 or 3, respectively. 
After each shot, the control puck is returned to the robot. The 
target puck, however, is only reset after every third shot. 

The 2-dimensional position of the target puck defines the con- 
text s of the task and the parameter vector 6 defines the goal 
positions of the DMP that define the desired trajectory of the 
robot's joints. After performing one shot, the agent observes the 
new context to plan the subsequent shot. In order to give the agent 
an incentive to shoot at the target puck, we punished the agent 
with the negative minimum distance of the control puck to the 
target puck after each shot. While this reward was given after every 
step, the zone reward was only given at the end of the episode 
(every third step) as r-g_ + i (sk + i ). 

We compared our sequential motor primitive learning method 
with its episodic variant on a realistic simulation. For the episodic 
variant we used one extended parameter vector 6 that contained 
the parameters for all three hockey shoots. The comparison of 



both methods can be seen in Figure 6 (middle). Due to the 
high-dimensional parameter space, the episodic learning setup 
failed to learn a proper policy while our sequential motor prim- 
itive learning framework could learn policies of much higher 
quality. 

On the real robot, we could reproduce the simulation results. 
The robot learned a strategy which could move the target puck to 
the highest reward zone in most of the cases after 300 episodes. 
The learning curve is shown in Figure 6 (right). 

3. PROBABILISTIC MOVEMENT PRIMITIVES 

In the second part of this paper, we investigate new representa- 
tions for the individual building blocks of movements that are 
particularly suited to be used in a modular control architec- 
ture. In all experiments for our modular policy search frame- 
work, we so far used the Dynamic Movement Primitive (DMP) 
approach (Schaal et al, 2003). DMPs are widely used, however, 
when used for our modular control architecture, DMPs suf- 
fer from severe limitations as they do not support co-activation 
or blending of building blocks. In addition, the DMPs use 
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heuristics for the adaptation of the motion. Hence, we focus 
our discussion on our new movement primitive (MP) repre- 
sentation (Paraschos et al., 2013) on a these two important 
properties. 

We use a trajectories r = {q t } t _ 0 T , defined by the joint 
angles q t over time, to model a single movement. We will use a 
probabilistic representation of a movement, which we call proba- 
bilistic movement primitives (ProMP), where a movement primi- 
tive describes several ways how to execute a movement (Paraschos 
et al, 2013). Hence, the movement primitive is given as distribu- 
tion p(r) over trajectories. A probabilistic representation offers 
several advantages that make it particularly suitable to be used 
in a modular control architecture. Most importantly, it offers 
principled ways to adapt as well as to co-activate movement 
primitives. Yet, these advantages of a probabilistic trajectory rep- 
resentation are of little use if we can not use it to control the 
robot. Therefore, we derive a stochastic feedback controller in 
closed form that can exactly reproduce a given trajectory distribu- 
tion, and, hence, trajectory distributions can be used directly for 
robot control. 

In this section, we present two experiments that we performed 
with the ProMP approach. As we focused on the representation 
of the individual building blocks, we evaluated the new repre- 
sentation without the use of reinforcement learning and learned 
the ProMPs by imitation. In our experiments, we illustrate how 
to use conditioning as well as co-activation of the building 
blocks. 

3.1. PROBABILISTIC TRAJECTORY REPRESENTATION 

In the imitation learning setup, we assume that we are given sev- 
eral demonstrations in terms of trajectories r,. In our probabilistic 
approach we want to learn a distribution of these trajectories. We 
will first explain the basic representation of a trajectory distribu- 
tion and subsequently cover the two new operations that are now 
available in our probabilistic framework, i.e., conditioning and 
co-activation. Finally, we will explain in Section 3.3 how to con- 
trol the robot with a stochastic feedback controller that exactly 
reproduces the given trajectory distribution. 

We use a weight vector w to compactly represent a single 
trajectory r. The probability of observing a trajectory r given 
the weight vector w is given as a linear basis function model 
p(r ]w) = Yit-N' {jt\^J w ' Ey)> where y f = [q t , q t ] T contains the 
joint position q t and joint velocity q t , ^> t = [x/r, i/r ( ] defines the 
time-dependent basis matrix and e y is zero-mean i.i.d. Gaussian 
noise. 

We now abstract a distribution over trajectories as distribu- 
tion p(v/; 8) over the weight vector w that is parametrized by the 
parameter vector 6. The original trajectory distribution p(r; 6) 
can now be computed by marginalizing of the weight vector w, 
i.e., p(r;6) = f p(r|w)p(w; 0)dv/. We will assume a Gaussian 
distribution for p(w; 6) =J\f (w|/U. w , E„) and, hence,p(r; 6) can 
be computed analytically, i.e., 

p (y t ; 6) = Af (y f |*^ w , *?X w * t + E y ) . 

As a probabilistic MP represents multiple ways to execute an ele- 
mental movement, we also need multiple demonstrations to learn 



p(w; 6). The parameters 6 = {/x w , E w } can be learned by maxi- 
mum likelihood estimation, for example, by using the expectation 
maximization algorithm (Lazaric and Ghavamzadeh, 2010). 

For multi-dimensional systems, we can also learn the cou- 
pling between the joints. Coupling is typically represented by the 
covariance of the joint positions and velocities. We can learn this 
covariance by maintaining a parameter vector w,- for each dimen- 
sion i and learn a distribution over the combined weight vector 

w= [w[, ...,w£] . 

To be able to adapt the execution speed of the movement, 
we introduce a phase variable z to decouple the movement from 
the time signal (Schaal et al., 2003). The phase can be any func- 
tion z(f ) monotonically increasing with time. The basis functions 
ifr t are now decoupled from the time and depend on the phase, 
such that \f/ t = i/f (z t ) and i/r t = i/r'(z f )z f . The choice of the basis 
functions depends on whether we want to model rhythmic move- 
ments, where we use normalized Von-Mises basis functions that 
are periodic in the phase, or stroke-based movements, where we 
use normalized Gaussian basis functions, 

4>, (z) = exp — j , 

<PJ m (z) = exp (hcos (27t(z t - t?))) . (8) 

The parameter h defines the width of the basis and c, the center 
for the zth basis function. We normalize the basis functions 0,- 
with i// i(z t ) = <pi(z)/ £,0,(z). 

3.2. NEW PROBABILISTIC OPERATORS FOR MOVEMENT PRIMITIVES 

The probabilistic formulation of MPs enables us to use new prob- 
abilistic operators on our movement primitive representation. 
Adaptation of the movement can be accomplished by condition- 
ing on desired positions or velocities at time step t. Co-activation 
and blending of MPs can be implemented as as product of two 
trajectory distributions. 

3.2. 1. Adaptation of the building blocks by conditioning 

For efficient adaptation, our building blocks should support 
the modulation of hyper-parameters of the movements such as 
the desired final joint positions or the joint positions at given 
via-points. For example, DMPs allow for the adaptation of the 
final position by modulation of the point attractor of the sys- 
tem. However, how the final position modulates the trajectory 
is hard-coded in the DMP-framework and can not be learned 
from data. This adaptation mechanism might violate other task 
constraints. 

In our probabilistic formulation, such adaptation operations 
can be described by conditioning the MP to reach a certain state 
y^ at time f. Conditioning can be performed by adding a new 

desired observation x t = j^yj, E*J to our probabilistic model 
where represents the desired position and velocity vector at 
time t and E* specifies the accuracy of the desired observation. 
By applying Bayes theorem, we obtain a new distribution over 

w, i.e.,p(w|x*) oc Af (y^l^Jw, E*^ p(w). As p(w\6) is Gaussian, 
the conditional distribution p (w|y*) is also Gaussian and can be 
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computed analytically 

/4 n ™ ] = + £w* f (s; + * f T E w * t )~' (yf - *?*Mw),(9) 
E^ = E w - E w * t (s; + * f r E w * t )~' * f r E w . (10) 

We illustrated conditioning a ProMP to different target states 
in Figure 7 A. As we can see, the modulation of a target state 
is also learned from demonstration, i.e., the ProMP will choose 
a new trajectory distribution that goes through the target state, 
and, at the same time, is similar to the learned trajectory 
distribution. 

3.2.2. Combination and blending by multiplying distributions 

In our probabilistic representation, a single MP represents a 
whole family of movements. Co-activating two MPs should 
return a new set of movements which are contained in both MPs. 
Such operation can be performed by multiplying two distribu- 
tions. We also want to weight the activation of each primitive o, 
by a time-varying activation factor a,(f), for example, to con- 
tinuously blend the movement execution from one primitive to 
the next. The activation factors can be implemented by tak- 
ing the distributions of the individual primitives to the power 
of a,(t). Hence, the co-activation of ProMPs yields p*(r) oc 

UtUiPi(ytr i(t) . 

For Gaussian distributions p; (y>) = ftf (jt\l$, E^j, the 

resulting distribution p*(yt) is again Gaussian and we can 
obtain its mean fi* and variance E* analytically with variance 
and mean 



Both terms are required to obtain the stochastic feedback con- 
troller that is finally used to control the robot. We illustrated 
co-activating two ProMPs in Figure 7B and blending of two 
ProMPs in Figure 7C. 

3.3. USING TRAJECTORY DISTRIBUTIONS FOR ROBOT CONTROL 

In order to use a trajectory distribution p(t\0) for robot control, 
we have to obtain a controller which can exactly reproduce the 
given distribution. As we show in Paraschos et al. (2013), such 
controller can be obtained in closed form if we know the system 
dynamics y = /(y, u) + e y of the robot 1 . We model the con- 
troller as time-varying stochastic linear feedback controller, i.e., 
u t = k t + K t y t + e u , where k f denotes the feed-forward gains, 
Kf the feedback gains and e u ~ Af(0, E„) the controller noise. 
Hence, the controller is determined by k f , K f and E u for each time 
point. All these terms can be obtained analytically by predict- 
ing the distribution p mo del(yr+dt) from p(y t \6) with the known 
model of the system dynamics and subsequently matching the 
moments of p(y f +dtl#) and the moments of the predicted distri- 
bution pmodd(yf+dt)- The resulting controller exactly reproduces 
the given trajectory distribution p(t\0) (Paraschos et al., 2013). 

While the ProMP approach has many similarities to the 
approach introduced in Rozo et al. (2013) by Calinon and col- 
leagues, there are also important differences to this approach. 
They also learn a trajectory distribution which is modeled with 
a GMM, where the output variables are the joint angles and the 
time step f. The probability for the joint angles at time step f is 
then obtained by conditioning on f. However, it is unclear how 
to condition on being at a certain state q* at time step, which 
is very different then just conditioning on being in time step f. 
In this case, the mixture components need to be changed such 



(ii) 



'Alternatively, we can assume that we use inverse dynamics control on the 
robot, and, hence, the idealized dynamics of the robot are given by a linear 
system. Such an approach is, for example, followed by the DMPs that also 
assumes that the underlying dynamical system, that represents the robot, is 
linear. 
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FIGURE 7 | (A) Conditioning on different target states. The blue shaded 
area represents the learned trajectory distribution. We condition on 
different target positions, indicated by the "x"-markers. The produced 
trajectories exactly reach the desired targets while keeping the shape of 
the demonstrations. (B) Combination of two ProMPs. The trajectory 
distributions are indicated by the blue and red shaded areas. Both 
primitives have to reach via-points at different points in time, indicated by 



the "x"-markers. We co-activate both primitives with the same activation 
factor. The trajectory distribution generated by the resulting feedback 
controller now goes through all four via-points. (C) Blending of two 
ProMPs. We smoothly blend from the red primitive to the blue primitive. 
The activation factors are shown in the bottom. The resulting movement 
(green) first follows the red primitive and, subsequently, switches to 
following the blue primitive. 
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that the trajectory distribution passes through q* at time step t. 
How to implement this change with a GMM is an open problem. 
Note that the ProMP approach is very different from a GMM. 
It uses a linear basis function model and learns the correlation 
of the parameters of the basis functions for the different move- 
ments. Time is not modeled as random variable but as conditional 
variable right away. Due to the learned correlations, we can con- 
dition on reaching q* at time step f and the trajectory distribution 
smoothly passes through q* with high accuracy. 

Furthermore, a trajectory distribution alone is not sufficient 
to control a robot as it requires a feedback controller that deter- 
mines the control actions. How to obtain this feedback controller 
from the trajectory distribution is based on heuristics in Rozo 
et al. (2013). I.e., when we apply the feedback controller on the 
real robot, we will not reproduce the learned trajectory distri- 
bution. The produced trajectory distribution might be similar, 
but we do not know how similar. Therefore, for all operations 
performed on the trajectory distributions (i.e., a combination of 
distributions by a product), it is hard to quantify the effect of this 
operation on the resulting motions that are obtained from the 
heuristic feedback controller. In contrast, the ProMPs come with 
a feedback controller that exactly matches the trajectory distri- 
bution. Hence, for a combination of distributions, we know that 
the feedback controller will exactly follow the product of the two 
distributions. 

3.3. 1. Experimental evaluation of the combination of objectives at 
different time-points 

In this task, a seven link planar robot has to reach different target 
positions in end-effector space at the final time point tj and at 
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FIGURE 8 | A 7-link planar robot has to reach a target position at 
T = 1.0 s with its end-effector while passing a via-point at t\ = 0.25 s 
(top) or *2 = 0.75 s (middle). The plot depicts the mean posture of the robot 
at different time steps (black) and samples generated by the ProMP (gray). 



a via-point t v . We generated the demonstrations for learning the 
MPs with an optimal control law, (Toussaint, 2009) and adding 
noise to the control outputs. In the first set of demonstrations, 
the robot reached a via-point at fi = 0.25 s with its end-effector. 
We used 10 normalized Gaussian basis functions per joint, result- 
ing in a 70-dimensional weight vector. As we learned a single 
distribution over all joints of the robot, we can also model the 
correlations between the joints. These correlations are required to 
learn to reach a desired via-point in task space. The reproduced 
behavior with the ProMPs is illustrated in Figure 8 (top). The 
ProMP exactly reproduced the via-points in task space. Moreover, 
the ProMP exhibited the same variability in between the time 
points of the via-points. It also reproduced the coupling of the 
joints from the optimal control law, which can be seen by the 
small variance of the end-effector in comparison to the rather 
large variance of the single joints at the via-points. We also used a 
second set of demonstrations where the first via-point was located 
at time step t% = 0.75, which is illustrated in Figure 8 (middle). 
We co-activated the ProMPs learned from both demonstrations. 
The robot could accurately reach both via-points at t\ = 0.25 and 
tj = 0.75, see Figure 8 (bottom). 

3.3.2. Experimental evaluation of the combination of simultaneous 
objectives - robot hockey 

In this task, the robot again has to shoot a hockey puck in different 
directions and distances. The task setup can be seen in Figure 9A. 
We record two different sets of demonstrations, one that con- 
tains straight shots with varying distances, while the second set 
contains shots with a varying shooting angle and almost con- 
stant distance. Both data sets contained ten demonstrations each. 



The demonstrations have been generated by an optimal control law. The 
ProMP approach was able to exactly reproduce the coupling of the joints from 
the demonstrations. The combination of both learned ProMPs is shown in the 
bottom. The resulting movement reached both via-points with high accuracy. 
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FIGURE 9 | Robot Hockey. The robot shoots a hockey puck. The setup 
is shown in (A). We demonstrate ten straight shots for varying 
distances and ten shots for varying angles. The pictures show samples 
from the ProMP model for straight shots (B) and angled shots (C). 
Learning from combined data set yields a model that represents 



Sampling from the two models generated by the different data 
sets yields shots that exhibit the demonstrated variance in either 
angle or distance, as shown in Figures 9B,C. When combining the 
data sets of both primitives and learning a new primitive, we get a 
movement which exhibits variance in both dimensions, i.e., angle 
and distance, see Figure 9D. When the two individual primitives 
are combined by a product of MPs, the resulting model shoots 
only in the center at medium distance, i.e., the intersection of both 
MPs, see Figure 9F. 

In this section, we present two experiments that we performed 
with the ProMP approach. As we focused on the representation 
of the individual building blocks, we evaluated the new represen- 
tation without the use of reinforcement learning and learned the 
ProMPs by imitation. In our experiments, we illustrate how to use 
conditioning as well as co-activation of the building blocks. 

4. CONCLUSION AND FUTURE WORK 

Using structured, modular control architectures is a promising 
concept to scale robot learning to more complex real-world tasks. 
In such a modular control architecture, elemental building blocks, 
such as movement primitives, need to be adapted, sequenced or 
co-activated simultaneously. In this paper, we presented a unified 
data-efficient policy search framework that exploits such control 
architectures for robot learning. Our policy search framework 
can learn to select, adapt and sequence parametrized building 
blocks such as movement primitives while coping with the main 
challenges of robot learning, i.e., high dimensional, continuous 
state and action spaces and the high costs of generating data. 
Moreover, we presented a new probabilistic representation of the 
individual building blocks which show several beneficial proper- 
ties. Most importantly, they support efficient and principled ways 
of adapting a building block to the current situation and we can 
co-activate several of these building blocks. 

Future work will concentrate on integrating the new ProMP 
approach into our policy search framework. Interestingly, the 
upper-level policy would in this case directly specify the trajec- 
tory distribution. The lower level control policy is automatically 
given by this trajectory distribution. We will explore to incorpo- 
rate the co-activation of individual building blocks also in our 



variance in both, distance and angle (D). Multiplying the individual 
models leads to a model that only reproduces shots where both 
models had probability mass, in the center at medium distance (E). 
The last picture shows the effect of conditioning on only left or right 
angles, the robot does not shoot in the center any more (F). 



policy search framework. Additional future work will concentrate 
on incorporating perceptual feedback into the building blocks 
and using more complex hierarchies in policy search. 
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